INTERSPEECH.2013 - Speech Synthesis | Cool Papers

#1 Training an articulatory synthesizer with continuous acoustic data [PDF] [Copy] [Kimi¹]

Authors: Santitham Prom-on ; Peter Birkholz ; Yi Xu

This paper reports preliminary results of our effort to address the acoustic-to-articulatory inversion problem. We tested an approach that simulates speech production acquisition as a distal learning task, with acoustic signals of natural utterances in the form of MFCC as input, VocalTractLab . a 3D articulatory synthesizer controlled by target approximation models as the learner, and stochastic gradient descent as the training method. The approach was tested on a number of natural utterances, and the results were highly encouraging.

#2 Estimating speaker-specific intonation patterns using the linear alignment model [PDF] [Copy] [Kimi¹]

Authors: Géza Kiss ; Jan P. H. van Santen

Modeling speaker-specific intonation is important in several areas, including speaker identification, verification, and imitation using text-to-speech synthesis. However the choice of the intonation model and the estimation of its parameters from spontaneous speech remains a challenge. We propose a way to estimate speakerspecific intonation parameters for a particular superpositional model, the Simplified Linear Alignment Model, using robust per-utterance and overall statistics of spontaneous speech. We used this method to compare the intonation of children with autism or language impairment, who often have atypical speech prosody, with that of typically developing children. We found significant differences between the groups, which demonstrates the effectiveness of the proposed method.

#3 Factored maximum likelihood kernelized regression for HMM-based singing voice synthesis [PDF] [Copy] [Kimi¹]

Authors: June Sig Sung ; Doo Hwa Hong ; Hyun Woo Koo ; Nam Soo Kim

In our previous work, we proposed factored maximum likelihood linear regression (FMLLR) adaptation where each MLLR parameter is defined as a function of a control vector. In this paper, we introduce a novel technique called factored maximum likelihood kernelized regression (FMLKR) for HMM-based style adaptive speech synthesis. In FMLKR, nonlinear regression between the mean vector of the base model and the corresponding mean vectors of the adaptation data is performed with the use of kernel method based on the FMLLR framework. In a series of experiments on artificial generation of singing voice, the proposed technique shows better performance than the other conventional methods.

#4 Improvements to HMM-based speech synthesis based on parameter generation with rich context models [PDF] [Copy] [Kimi¹]

Authors: Shinnosuke Takamichi ; Tomoki Toda ; Yoshinori Shiga ; Sakriani Sakti ; Graham Neubig ; Satoshi Nakamura

In this paper, we improve parameter generation with rich context models by modifying an initialization method and further apply it to both spectral and F0 components in HMM-based speech synthesis. To alleviate over-smoothing effects caused by the traditional parameter generation methods, we have previously proposed an iterative parameter generation method with rich context models. It has been reported that this method yields quality improvements in synthetic speech but there are still limitations. This is because 1) this generation method still suffers from the over-smoothing effect, as it uses the parameters generated by the traditional method as an initial parameters, which strongly affect on the finally generated parameters and 2) it is applied to only the spectral component. To address these issues, we propose 1) an initialization method to generate less smoothed but more discontinuous initial parameters that tend to yield better generated parameters, and 2) a parameter generation method with rich context models for the F0 component. Experimental results show that the proposed methods yield significant improvements in quality of synthetic speech.

#5 Voice conversion in high-order eigen space using deep belief nets [PDF] [Copy] [Kimi¹]

Authors: Toru Nakashika ; Ryoichi Takashima ; Tetsuya Takiguchi ; Yasuo Ariki

This paper presents a voice conversion technique using Deep Belief Nets (DBNs) to build high-order eigen spaces of the source/target speakers, where it is easier to convert the source speech to the target speech than in the traditional cepstrum space. DBNs have a deep architecture that automatically discovers abstractions to maximally express the original input features. If we train the DBNs using only the speech of an individual speaker, it can be considered that there is less phonological information and relatively more speaker individuality in the output features at the highest layer. Training the DBNs for a source speaker and a target speaker, we can then connect and convert the speaker individuality abstractions using Neural Networks (NNs). The converted abstraction of the source speaker is then brought back to the cepstrum space using an inverse process of the DBNs of the target speaker. We conducted speaker-voice conversion experiments and confirmed the efficacy of our method with respect to subjective and objective criteria, comparing it with the conventional Gaussian Mixture Model-based method.

#6 Voice conversion for non-parallel datasets using dynamic kernel partial least squares regression [PDF] [Copy] [Kimi¹]

Authors: Hanna Silén ; Jani Nurminen ; Elina Helander ; Moncef Gabbouj

Voice conversion aims at converting speech from one speaker to sound as if it was spoken by another specific speaker. The most popular voice conversion approach based on Gaussian mixture modeling tends to suffer either from model overfitting or oversmoothing. To overcome the shortcomings of the traditional approach, we recently proposed to use dynamic kernel partial least squares (DKPLS) regression in the framework of parallel-data voice conversion. However, the availability of parallel training data from both the source and target speaker is not always guaranteed. In this paper, we extend the DKPLS-based conversion approach for non-parallel data by combining it with a well-known INCA alignment algorithm. The listening test results indicate that high-quality conversion can be achieved with the proposed combination. Furthermore, the performance of two variations of INCA are evaluated with both intra-lingual and cross-lingual data.

#7 A style control technique for singing voice synthesis based on multiple-regression HSMM [PDF] [Copy] [Kimi¹]

Authors: Takashi Nose ; Misa Kanemoto ; Tomoki Koriyama ; Takao Kobayashi

This paper proposes a technique for controlling singing style in the HMM-based singing voice synthesis. A style control technique based on multiple regression HSMM (MRHSMM), which was originally proposed for the HMM-based expressive speech synthesis, is applied to the conventional technique. The idea of pitch adaptive training is introduced into the MRHSMM to improve the modeling accuracy of fundamental frequency (F0) associated with notes. A robust vibrato modeling technique based on a moving average filter is also proposed to reproduce a natural-sounding vibrato expression even when the vibrato expression of the original singing voice is unclear. Subjective evaluation results show that users can intuitively control a singing style while keeping naturalness of the synthetic voice.

#8 Predicting the quality of text-to-speech systems from a large-scale feature set [PDF] [Copy] [Kimi¹]

Authors: Florian Hinterleitner ; Christoph R. Norrenbrock ; Sebastian Möller ; Ulrich Heute

We extract 1495 speech features from 2 subjectively evaluated text- to-speech (TTS) databases. These features are extracted from pitch, loudness, MFCCs, spectrals, formants, and intensity. The speech material is synthesized using up to 15 different TTS systems, some of them with up to 8 different voices. We develop quality predictors for TTS signals following two different approaches to handle the huge set of speech features: a three-step feature selection followed by a stepwise multiple linear regression and an approach based on support vector machines. The predictors are cross-validated via 3-fold cross validation (CV) and leave-one-test-out (LOTO) CV. Due to the high number of features we apply a strict CV method where the partitioning is realized prior to the feature scaling and feature selection steps. In comparison we also follow a semi-strict approach where the partitioning effectively takes place after these steps. In the 3-fold CV case we achieve correlations as high as .75 for strict CV and .89 for semi-strict CV. The more ambitious LOTO CV yields correlations around .80 for the male speakers whereas the results for the female voices show the need for improvement.

#9 Speaker-specific retraining for enhanced compression of unit selection text-to-speech databases [PDF] [Copy] [Kimi¹]

Authors: Jani Nurminen ; Hanna Silén ; Moncef Gabbouj

Unit selection based text-to-speech systems can generally obtain high speech quality provided that the database is large enough. In embedded applications, the related memory requirements may be excessive and often the database needs to be both pruned and compressed to fit it into the available memory space. In this paper, we study the topic of database compression. In particular, the focus is on speaker-specific optimization of the quantizers used in the database compression. First, we introduce the simple concept of dynamic quantizer structures, facilitating the use of speaker-specific optimizations by enabling convenient run-time updates. Second, we show that significant memory savings can be obtained through speaker-specific retraining while perfectly maintaining the quantization accuracy, even when the memory required for the additional codebook data is taken into account. Thus, the proposed approach can be considered effective in reducing the conventionally large footprint of unit selection based text-to-speech systems.

#10 Avatar therapy: an audio-visual dialogue system for treating auditory hallucinations [PDF] [Copy] [Kimi¹]

Authors: Mark Huckvale ; Julian Leff ; Geoff Williams

This paper presents a radical new therapy for persecutory auditory hallucinations ("voices") which are most commonly found in serious mental illnesses such as schizophrenia. In around 30% of patients these symptoms are not alleviated by anti-psychotic medication. This work is designed to tackle the problem created by the inaccessibility of the patients' experience of voices to the clinician. Patients are invited to create an external representation of their dominant voice hallucination using computer speech and animation technology. Customised graphics software is used to create an avatar that gives a face to the voice, while voice morphing software realises it in audio, in real time. The therapist then conducts a dialogue between the avatar and the patient, with a view to gradually bringing the avatar, and ultimately the hallucinatory voice, under the patient's control. Results of a pilot study reported elsewhere indicate that the approach has potential for dramatic improvements in patient control of the voices after a series of only six short sessions. The focus of this paper is on the audio-visual speech technology which delivers the central aspects of the therapy.

#11 Optimizations and fitting procedures for the liljencrants-fant model for statistical parametric speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Prasanna Kumar Muthukumar ; Alan W. Black ; H. Timothy Bunnell

Every parametric speech synthesizer requires a good excitation model to produce speech that sounds natural. In this paper, we describe efforts toward building one such model using the Liljencrants-Fant (LF) model. We used the Iterative Adaptive Inverse Filtering technique to derive an initial estimate of the glottal flow derivative (GFD). Candidate pitch periods in the estimated GFD were then located and LF model parameters estimated using a gradient descent optimization algorithm. Residual energy in the GFD, after subtracting the fitted LF signal, was then modeled by a 4-term LPC model plus energy term to extend the excitation model and account for source information not captured by the LF model. The ClusterGen speech synthesizer was then trained to predict these excitation parameters from text so that the excitation model could be used for speech synthesis. ClusterGen excitation predictions were further used to reinitialize the excitation fitting process and iteratively improve the fit by including modeled voicing and segmental influences on the LF parameters. The results of all of these methods have been confirmed both using listening tests and objective metrics.

#12 Analysis and modeling of “focus” in context [PDF] [Copy] [Kimi¹]

Authors: Dirk Hovy ; Gopala Krishna Anumanchipalli ; Alok Parlikar ; Caroline Vaughn ; Adam Lammert ; Eduard Hovy ; Alan W. Black

This paper uses a crowd-sourced definition of a speech phenomenon we have called "focus". Given sentences, text and speech, in isolation and in context, we asked annotators to identify what we term the "focus" word. We present their consistency in identifying the focused word, when presented with text or speech stimuli. We then build models to show how well we predict that focus word from lexical (and higher) level features. Also, using spectral and prosodic information, we show the differences in these focus words when spoken with and without context. Finally, we show how we can improve speech synthesis of these utterances given focus information.

#13 Probabilistic speech F0 contour model incorporating statistical vocabulary model of phrase-accent command sequence [PDF] [Copy] [Kimi¹]

Authors: Tatsuma Ishihara ; Hirokazu Kameoka ; Kota Yoshizato ; Daisuke Saito ; Shigeki Sagayama

We have previously proposed a generative model of speech F0 contours, based on the discrete-time version of the Fujisaki model (a model of the mechanism for controlling F0s through laryngeal muscles). One advantage of this model is that it allows us to apply statistical methods to estimate the Fujisaki-model parameters from speech F0 contours. This paper proposes a new generative model of speech F0 contours incorporating a vocabulary model of intonation patterns. A parameter inference algorithm for the present model is derived. We quantitatively evaluated the performance of our parameter inference algorithm.

#14 Reconstruction of continuous voiced speech from whispers [PDF] [Copy] [Kimi¹]

Authors: Ian Vince McLoughlin ; Jingjie Li ; Yan Song

Whispers are an important secondary vocal communications mechanism, that can be necessary for communicating private information and which are an integral aspect of natural humanto- human dialogue. Furthermore, they may be the primary communications method of those suffering from certain forms of aphonia, such as laryngectomees. This paper considers the conversion of continuous whispers to natural-sounding speech, and proposes a new reconstruction method based upon the synthesis of individual formants as excitation source, followed by artificial glottal modulation. Early results show that the proposed method can improve quality and intelligibility over the original whispers when evaluated using continuous speech. It requires neither a priori nor speaker-dependent information, is of relatively low-complexity and suitable for real-time processing.

#15 Generating fundamental frequency contours for speech synthesis in yorùbá [PDF] [Copy] [Kimi¹]

Authors: Daniel R. van Niekerk ; Etienne Barnard

We present methods for modelling and synthesising fundamental frequency (F0) contours suitable for application in text-to-speech (TTS) synthesis of Yoruba (an African tone language). These methods are discussed and compared with a baseline approach using the HMM-based speech synthesis system HTS. Evaluation is done by comparing ten-fold cross validation squared errors on a small corpus of four speakers. We show that the proposed methods are relatively effective at modelling and generating F0 contours in this context, achieving lower error rates than the baseline. These results suggest that our methods will be useful for the generation of improved synthesis of tone in African languages, which has been a challenge to date.

#16 Real-time voice conversion using artificial neural networks with rectified linear units [PDF] [Copy] [Kimi]

Authors: Elias Azarov ; Maxim Vashkevich ; Denis Likhachov ; Alexander Petrovsky

This paper presents an approach to parametric voice conversion that can be used in real-time entertainment applications. The approach is based on spectral mapping using an artificial neural network (ANN) with rectified linear units (ReLU). To overcome the oversmoothing problem a special network configuration is proposed that utilizes temporal states of the speaker. The speech is represented using the harmonic plus noise model. The parameters of the model are estimated using instantaneous harmonic parameters. Using objective and subjective measures the proposed voice conversion technique is compared to the main alternative approaches.

#17 Generation of fundamental frequency contours for Thai speech synthesis using tone nucleus model [PDF] [Copy] [Kimi¹]

Authors: Oraphan Krityakien ; Keikichi Hirose ; Nobuaki Minematsu

As classic and intrinsic requirements, synthetic speech need to convey correct information with good quality of naturalness to listeners. Fundamental frequency (F0) contours need to be controlled to meet these requirements. Additional challenges have been introduced to tonal languages because the F0 contour reflects both intelligibility and naturalness of the speech. According to the fact that the F0 contour in a syllable conveys information asymmetrically, Tone nucleus model has been successfully established. In this study, Tone nucleus model is applied in order to generate F0 contours for Thai speech synthesis. This is among the first that has introduced the model to other tonal languages other than Mandarin. All tone nuclei for five distinctive tones are defined according to the underlying targets. The full process of F0 contour generation is presented from the nucleus extraction until the F0 contour generation for continuous speech. The efficiency and adaptability of the model in Thai language were confirmed by the objective and subjective tests. The model outperformed a baseline without applying the model. The generated F0 contours showed less distortion, more tone intelligibility and more naturalness. The modified method is also introduced for enhancement. The results showed significant improvement on the generated F0 contours.

#18 Unsupervised speaker and expression factorization for multi-speaker expressive synthesis of ebooks [PDF] [Copy] [Kimi¹]

Authors: Langzhou Chen ; Norbert Braunschweiler

This work aims to improve expressive speech synthesis of ebooks for multiple speakers by using training data from many audiobooks. Audiobooks contain a wide variety of expressive speaking styles which are often impractical to annotate. However, the speaker-expression factorization (SEF) framework, which has been proven to be a powerful tool in speaker and expression modelling usually requires the (supervised) information about expressions in the training data. This work presents an unsupervised SEF method which implements the SEF on unlabelled training data in the framework of cluster adaptive training (CAT). The proposed method integrates the expression clustering and parameter estimation in a single process to maximize the likelihood of the training data. Experimental results indicate that it outperforms the cascade system of expression clustering and supervised SEF, and significantly improves the expressiveness of the synthetic speech of different speakers.

#19 Which resemblance is useful to predict phrase boundary rise labels for Japanese expressive text-to-speech synthesis, numerically-expressed stylistic or distribution-based semantic? [PDF] [Copy] [Kimi¹]

Authors: Hideharu Nakajima ; Hideyuki Mizuno ; Osamu Yoshioka ; Satoshi Takahashi

To establish Expressive Text-to-speech synthesis, current research studies both the processing of input text and the rendering of natural expressive speech. Focusing on the former as a front-end task in the production of synthetic speech, this paper investigates a novel feature for predicting phrase boundary tone labels which transcribe local fundamental frequency (F0) changes frequently appearing at phrase end positions in expressive speech. To this end, we examined a kind of distribution-based semantic features consisting of i) word surface strings, ii) their part-of-speech tags taken from a phrase and iii) the pause existence/non-existence at the final position of the phrase, which are different from conventional numerically-expressed stylistic features such as positions and lengths and distances of the phrase. Through experiments on Japanese expressive speech such as conversational speech and advertisement speech, we confirmed that the proposed features attain performance equal to or better than conventional features. These results suggest that the distribution-based semantic features might be useful to predict phrase boundary rise labels for conversational speech and might be useful equal to conventional numerically-expressed stylistic feature for advertisement speech.

#20 A targets-based superpositional model of fundamental frequency contours applied to HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Jinfu Ni ; Yoshinori Shiga ; Chiori Hori ; Yutaka Kidawara

Superpositional model of fundamental frequency (F0) contours as suggested by the Fujisaki model can well represent F0 movements of speech keeping a clear relation with linguistic information of utterances. Therefore, improvement of HMM-based speech synthesis is expected by using the merit of superpositional model. In this paper, a targets-based superpositional model is proposed in the light of the Fujisaki model. Here, both accent and phrase components are parameterized by respectively defined low and high targets which allow flexible interaction between accent and phrase components. Due to the flexible interaction, the new method consistently treats such complex F0 movements as low digging, varying declination, and final lowering by simply adjusting parameter values. This facilitates extraction of the model parameters from observed F0 contours, which is one of major problems preventing the use of the Fujisaki model. Extraction of the target parameters is evaluated for a Japanese speech corpus and the F0 contours generated by the model are used for HMM training instead of the original. Listening test of synthetic speech indicates significant improvements in speech quality. Micro-prosodic effects are also investigated. Results show that adding the micro-prosody to the generated F0 contours does not significantly improve speech quality.

#21 An investigation of acoustic features for singing voice conversion based on perceptual age [PDF] [Copy] [Kimi¹]

Authors: Kazuhiro Kobayashi ; Hironori Doi ; Tomoki Toda ; Tomoyasu Nakano ; Masataka Goto ; Graham Neubig ; Sakriani Sakti ; Satoshi Nakamura

In this paper, we investigate the acoustic features that can be modified to control the perceptual age of a singing voice. Singers can sing expressively by controlling prosody and vocal timbre, but the varieties of voices that singers can produce are limited by physical constraints. Previous work has attempted to overcome this limitation through the use of statistical voice conversion. This technique makes it possible to convert singing voice characteristics of an arbitrary source singer into those of an arbitrary target singer. However, it is still difficult to intuitively control singing voice characteristics by manipulating parameters corresponding to specific physical traits, such as gender and age. In this paper, we focus on controlling the perceived age of the singer and, as a first step, perform an investigation of the factors that play a part in the listener's perception of the singer's age. The experimental results demonstrate that 1) the perceptual age of singing voices corresponds relatively well to the actual age of the singer, 2) speech analysis/synthesis processing and statistical voice conversion processing donft cause adverse effects on the perceptual age of singing voices, and 3) prosodic features have a larger effect on the perceptual age than spectral features.

#22 Effect of MPEG audio compression on HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Bajibabu Bollepalli ; Tuomo Raitio ; Paavo Alku

In this paper, the effect of MPEG audio compression on HMM-based speech synthesis is studied. Speech signals are encoded with various compression rates and analyzed using the GlottHMM vocoder. Objective evaluation results show that the vocoder parameters start to degrade from encoding with bit-rates of 32 kbit/s or less, which is also confirmed by the subjective evaluation of the vocoder analysis-synthesis quality. Experiments with HMM-based speech synthesis show that the subjective quality of a synthetic voice trained with 32 kbit/s speech is comparable to a voice trained with uncompressed speech, but lower bit rates induce clear degradation in quality.

#23 Evaluation of a singing voice conversion method based on many-to-many eigenvoice conversion [PDF] [Copy] [Kimi¹]

Authors: Hironori Doi ; Tomoki Toda ; Tomoyasu Nakano ; Masataka Goto ; Satoshi Nakamura

In this paper, we evaluate our proposed singing voice conversion method from various perspectives. To enable singers to freely control their voice timbre of singing voice, we have proposed a singing voice conversion method based on many-to-many eigenvoice conversion (EVC) that enables to convert the voice timbre of an arbitrary source singer into that of another arbitrary target singer using a probabilistic model. Furthermore, to easily develop training data consisting of multiple parallel data sets between a single reference singer and many other singers, a technique for efficiently and effectively generating the parallel data sets from nonparallel singing voice data sets of many singers using a singingto- singing synthesis system have been proposed. However, we have never conducted sufficient investigations into the effectiveness of these proposed methods. In this paper, we conduct both objective and subjective evaluations to carefully investigate the effectiveness of proposed methods. Moreover, the differences between singing voice conversion and speaking voice conversion are also analyzed. Experimental results show that our proposed method succeeds in enabling people to control their own voice timbre by using only an extremely small amount of the target singing voice.

#24 Statistical nonparametric speech synthesis using sparse Gaussian processes [PDF] [Copy] [Kimi¹]

Authors: Tomoki Koriyama ; Takashi Nose ; Takao Kobayashi

This paper proposes a statistical nonparametric speech synthesis technique based on a sparse Gaussian process regression (GPR). In our previous study, we proposed GPR-based speech synthesis where each frame of synthesis units is modeled by a regression of Gaussian processes. Preliminary experiments of synthesizing several phones including both vowels and consonants showed a potential of the technique. In this paper, the previous work is extended to full-sentence speech synthesis using sparse GPs and context modification. Specifically, cluster-based sparse Gaussian processes such as local GPs and partially independent conditional (PIC) approximation are examined as a computationally feasible approach. Moreover, frame-level context is extended to include not only a position context from a current phone but also adjacent phones to generate smoothly changing speech parameters. Objective and subjective evaluation results show that the proposed technique outperforms the HMM-based speech synthesis with minimum generation error training.

#25 Hybrid nearest-neighbor/cluster adaptive training for rapid speaker adaptation in statistical speech synthesis systems [PDF] [Copy] [Kimi¹]

Authors: Amir Mohammadi ; Cenk Demiroglu

Statistical speech synthesis (SSS) approach has become one of the most popular methods in the speech synthesis field. An advantage of the SSS approach is the ability to adapt to a target speaker with a couple of minutes of adaptation data. However, many applications, especially in consumer electronics, require adaptation with only a few seconds of data which can be done using eigenvoice adaptation techniques. Although such techniques work well in speech recognition, they are known to generate perceptual artifacts in statistical speech synthesis. Here, we propose two methods to both alleviate those quality problems and improve the speaker similarity obtained with the baseline eigenvoice adaptation algorithm. Our first method is based on using a Bayesian approach for constraining the eigenvoice adaptation algorithm to move in realistic directions in the speaker space to reduce artifacts. Our second method is based on finding a reference speaker that is close to the target speaker, and using that reference speaker as the seed model in a second eigenvoice adaptation step. Both techniques performed significantly better than the baseline eigenvoice method in the subjective quality and similarity tests.